Overview and Motivation

The landscape of Hollywood films are rapidly changing. Over the generetions, films are evolving rapidly and boast huge diversity in themes, genres, actors, directors, runtime etc. Themes that were relevant in the past are no longer present in today’s films. Even within the same genre, emphasis on various features have evolved to accomodate modern expectations. As an example consider Marvel Comics based films pre-2000 and post 2000 era. Prominent pre-2000 era Marvel “classics” include Howard the Duck (1986) and Captain America (1990 - direct to video). After 2000, instead of going down the corny-campy route, Marvel revamped its story lines, hired serious directors, better actors and switched to high production values. Needless to say, their formula worked well.

Anyone decently familiar with today’s actors, directors or production companies often heuristically puts heavy emphasis on the former 3 factors for a movie’s success. For instance, is it really surprising that a Daniel Day-Lewis film or a Christopher Nolan directed film or a Disney Pixar produced film succeeded in box office? Not really.

However, the movie landscape is incredibly convoluted and diverse. Not all successful films have a combinaiton of good actors, experienced directors or big budgetted production companies. Therefore, we wanted to formally investigate all features (aside from review) that can predict of a movie’s success.

Our goal for this project is two fold : 1. To view how the film landscape has changed in the past few decades, and 2. To identify key features that are predictive of any given movie’s success.

Overall, we ask : Is it possible to find a formula for success of any Hollywood film or even a film concept?


Initial Questions

  1. How have people’s taste in genre and themes changed in the past few decades? What genres are important today?

  2. What genres generate the most profit?

  3. Can we codify any actor, director or production company’s influence by giving each a score?

  4. Does spending money on one principal actor bring in lots of profit?

  5. After assigning scores, can we use features avaiable for a movie (excluding ratings) - actor (score), director (score), production company (score), runtime, genre and themes to predict a movie’s success?

  6. Do movies in the different budget classes (for example high vs. low budget) have the same predictors? In other words, do we have confounding from a movie’s budget? Do we have to stratify our data?


Data: Source, scraping method, cleanup, etc.

Source

We tried IMDB first, which comprehensively stores most of the relevant information. Unfortunately, IMDB restricts the information download up to 1000 movies. Furthermore, no information on total revenue summary (which was used to assign actor scores) was readily available.

Subsequently, we used TMDB - which provides free and user-friendly API. We cross checked some of the information to make sure that the data was accurate, and it was. TMDB provides infomraiton on 5 main actors, director, budget, revenue, genres, themes and release date. Information on earlier movies (from 1940s and 1950s) are scarcier and contain missing information on budget and revenue.

Scraping method

We used a python script to scrape all data from TMDB API.

Cleanup

Data tables

We have provided 4 tables : the raw data table, the modified data table, additional information table, and upcoming table

Reading in raw data and processed data

# Raw Data
raw_data <- read.csv("data_raw.csv") %>% tbl_df %>% select(-X)
# Cleaned data directly from CSV
data <- read.csv("movies_3.csv") %>% tbl_df %>% select(-X)
data <- data %>%mutate(date=parse_date_time(releaseDate,"mdy"))
data <-data %>% mutate(m=month(date), releaseYear = year(date))

The raw data table consists of information on all movies from 1920 onwards. We used the raw data to visualize trends on budget, genre, profit etc.

The processed data table is derived from the raw data table and contains clean and complete information on movies from 1987 onwards. We used this dataset to set actor and director scores and to perform all subsequent data analysis and prediction

The additional information table consists of information on movie themes and keywords, which was used to evaluate trends in themes and categorize movies into binary saving the world or superhero movie category

The upcoming movies table to predict the profit of films released in 2016

Production company

We standardized the names of top production companies by combining all the variations of a company’s name (such as “Fox” => “20th Century Fox”) and included all subsidiary companies under the parent company (such as “Blue Sky Pictures” => “Warner Bros”). Finally, we added the total number of films under each production company

# Standardize production company name
data <- data %>% 
  mutate(production= gsub(".*Fox.*", "20th Century Fox", production)) %>%
  mutate(production= gsub(".*Alliance.*", "Alliance", production)) %>%
  mutate(production= gsub(".*BBC.*", "BBC", production)) %>%
  mutate(production= gsub(".*Universal.*", "Universal Pictures", production)) %>%
  mutate(production= gsub(".*Paramount.*", "Paramount Film", production)) %>%
  mutate(production= gsub(".*Columbia.*", "Columbia Pictures", production)) %>%
  mutate(production= gsub(".*Disney.*", "Walt Disney", production)) %>%
  mutate(production= gsub(".*DreamWorks.*", "DreamWorks", production)) %>%
  mutate(production= gsub(".*Warner.*", "Warner Bros", production)) %>%
  mutate(production= gsub(".*Summit.*", "Summit Entertainment", production)) %>%
  mutate(production= gsub(".*Lions.*", "Lions", production)) %>%
  mutate(production= gsub(".*Ingenious.*", "Ingenious", production)) %>%
  mutate(production= gsub(".*Regency.*", "Regency", production)) %>%
  mutate(production= gsub(".*Sony.*", "Sony", production)) %>%
  mutate(production= gsub(".*Canal.*", "Canal", production)) %>%
  mutate(production= gsub(".*France.*", "France", production)) %>%
  mutate(production= gsub(".*Gems.*", "Sony", production)) %>%
  mutate(production= gsub(".*Marvel.*", "Walt Disney", production))%>%
  mutate(production= gsub(".*Touchstone.*", "Walt Disney", production)) %>%
  mutate(production= gsub(".*Dimension.*", "The Weinstein Company", production))  %>%
  mutate(production= gsub(".*TriStar.*", "Sony", production))   %>%
  mutate(production= gsub(".*DC.*", "Warner Bros", production))   %>%
  mutate(production= gsub(".*Castle Rock.*", "Warner Bros", production))  %>%
  mutate(production= gsub(".*Caravan Pictures.*", "Spyglass Entertainment", production))  %>%
  mutate(production= gsub(".*United Artists.*", "MGM", production)) %>%
  mutate(production= gsub(".*MGM.*", "MGM", production)) %>%
  mutate(production= gsub(".*Legendary Pictures.*", "Warner Bros", production)) %>%
  mutate(production= gsub(".*Destination Films.*", "Sony", production)) %>%
  mutate(production= gsub(".*Rogue Pictures.*", "Relativity Media", production)) %>%
  mutate(production= gsub(".*Fine Line Features.*", "New Line Cinema", production)) %>%
  mutate(production= gsub(".*Hollywood Pictures.*", "Walt Disney", production)) %>%
  mutate(production= gsub(".*Channel Four Films.*", "Film4", production)) %>%
  mutate(production= gsub(".*Film 4.*", "Film4", production)) %>%
  mutate(production= gsub(".*Artisan Entertainment.*", "Lions", production)) %>%
  mutate(production= gsub(".*Lucasfilm.*", "Walt Disney", production)) %>%
  mutate(production= gsub(".*Working Title Films.*", "Universal Pictures", production)) %>%
  mutate(production= gsub(".*Revolution.*", "Revolution", production)) %>%
  mutate(production= gsub(".*Focus Features.*", "Universal Pictures", production)) %>%
  mutate(production= gsub(".*Silver Pictures.*", "Warner Bros", production)) %>%
  mutate(production= gsub(".*Blue Sky Studios.*", "Warner Bros", production)) %>%
  mutate(USA=ifelse(country=="United States of America",1,0)) %>%
  select(-country)
# Total number of movies in each production company
data <- data%>% 
  group_by(production) %>%
  mutate(s_production=n()) %>%
  ungroup()

Genre

NOTE : This was performed on our raw data table ( variable = raw_data ).

We seprated the sinlge column of genres (in list form) into separate genre columns and assigned True/False in each genre category for each movie.

movies<-raw_data%>%mutate(date=parse_date_time(releaseDate,"mdy"))
movies<-movies%>%mutate(year=as.numeric(year(date)))
movies <- movies %>% mutate(year=ifelse(year>2015,year-100,year))
genre_list<-c(
  'Action','Adventure','Animation','Comedy','Crime','Documentary','Drama','Family','Fantasy','Foreign','History','Horror','Music','Mystery','Romance','ScienceFiction','Thriller','War','Western')
head(movies$genres)
## [1] ['Animation', 'Comedy', 'Family']       
## [2] ['Adventure', 'Fantasy', 'Family']      
## [3] ['Romance', 'Comedy']                   
## [4] ['Comedy', 'Drama', 'Romance']          
## [5] ['Comedy']                              
## [6] ['Action', 'Crime', 'Drama', 'Thriller']
## 1790 Levels: ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime'] ...
# Function to separate genres
fun<-function(x){
  grepl(x,movies$genres)
}
# Apply function
tmp<-sapply(genre_list,fun)
movie_genre<-cbind(movies,tmp)

Exploratory Analysis

Trend Analysis on raw data

People’s taste

# Total movies in each genre in the raw data set
movie_genre%>%select(Action:Western)%>%
  apply(2,sum)
##         Action      Adventure      Animation         Comedy          Crime 
##           2079           1243            380           3223           1455 
##    Documentary          Drama         Family        Fantasy        Foreign 
##            185           4984            774            739            224 
##        History         Horror          Music        Mystery        Romance 
##            387            954            364            798           1938 
## ScienceFiction       Thriller            War        Western 
##              0           2451            346            225

We can see most of the movies are Action, Adventure, Comedy, Crime, Romance and Thriller. Note that the Science fiction category tally is not really 0, as in reality, science fiction movies may have been categorized in the other categories.

t1<-movie_genre%>%group_by(year)%>%
  summarize(p_Action=sum(Action)/n())
t2<-movie_genre%>%group_by(year)%>%
  summarize(p_Adventure=sum(Adventure)/n())
t3<-movie_genre%>%group_by(year)%>%
  summarize(p_Comedy=sum(Comedy)/n())
t4<-movie_genre%>%group_by(year)%>%
  summarize(p_Crime=sum(Crime)/n())
t5<-movie_genre%>%group_by(year)%>%
  summarize(p_Romance=sum(Romance)/n())
t6<-movie_genre%>%group_by(year)%>%
  summarize(p_Thriller=sum(Thriller)/n())
t<-t1%>%full_join(t2)%>%full_join(t3)%>%full_join(t4)%>%full_join(t5)%>%full_join(t6)
## Joining by: "year"
## Joining by: "year"
## Joining by: "year"
## Joining by: "year"
## Joining by: "year"
# Plots
t%>%gather(key=genre_percent,value=percentage,-year)%>%
  ggplot(aes(x=year,y=percentage,col=genre_percent))+
  geom_point()+geom_smooth(se=FALSE)+ggtitle("Relative composition of each genre")

t%>%gather(key=genre_percent,value=percentage,-year)%>%
  filter(year>1995)%>%
  ggplot(aes(x=year,y=percentage,col=genre_percent))+
  geom_point()+geom_smooth(se=FALSE)+ggtitle("Relative composition of each genre after 1995")

tmp<-t%>%gather(key=genre_percent,value=percentage,-year)
tmp<-tmp%>%mutate(decade=floor(year/10)*10)
p<- tmp%>%ggplot(aes(year,percentage,frame=decade))+geom_point()+geom_smooth(se=FALSE,aes(frame=decade))+facet_wrap(~genre_percent)
#gg_animate(p,"p1.gif")
#![p1](p1.gif)

The above plots describe how people’s taste have changed through the years. In the post 1995 graph, we can see a clear drop in comedies and romance, in favor of action and adventure.

Profit from each movie category

t1<-movie_genre%>%group_by(year)%>%
  filter(Action==TRUE)%>%
  summarize(r_Action=mean(revenue,na.rm=TRUE))
t2<-movie_genre%>%group_by(year)%>%
  filter(Adventure==TRUE)%>%
  summarize(r_Adventure=mean(revenue,na.rm=TRUE))
t3<-movie_genre%>%group_by(year)%>%
  filter(Comedy==TRUE)%>%
  summarize(r_Comedy=mean(revenue,na.rm=TRUE))
t4<-movie_genre%>%group_by(year)%>%
  filter(Crime==TRUE)%>%
  summarize(r_Crime=mean(revenue,na.rm=TRUE))
t5<-movie_genre%>%group_by(year)%>%
  filter(Romance==TRUE)%>%
  summarize(r_Romance=mean(revenue,na.rm=TRUE))
t6<-movie_genre%>%group_by(year)%>%
  filter(Thriller==TRUE)%>%
  summarize(r_Thriller=mean(revenue,na.rm=TRUE))
t<-t1%>%full_join(t2)%>%full_join(t3)%>%full_join(t4)%>%full_join(t5)%>%full_join(t6)
## Joining by: "year"
## Joining by: "year"
## Joining by: "year"
## Joining by: "year"
## Joining by: "year"
# Plots
t%>%gather(key=genre_revenue,value=average_revenue,-year)%>%
  ggplot(aes(x=year,y=average_revenue,col=genre_revenue))+
  geom_point()+geom_smooth(se=FALSE)+ggtitle("Average movie revenue of movie genre")

t%>%gather(key=genre_revenue,value=average_revenue,-year)%>%
  filter(year>1995)%>%
  ggplot(aes(x=year,y=average_revenue,col=genre_revenue))+
  geom_point()+geom_smooth(se=FALSE)+ggtitle("Average movie revenue of movie genre after 1995")

tmp<-t%>%gather(key=genre_revenue,value=average_revenue,-year)%>%
  filter(year>1995)%>%mutate(decade=floor(year/5)*5)
p<- tmp%>%ggplot(aes(year,average_revenue))+geom_point()+geom_smooth(se=FALSE,aes(frame=decade,group=decade))+facet_wrap(~genre_revenue)
#gg_animate(p,"p2.gif")
#![p2](p2.gif)

We see a large increase in movie revenue across all genres over the years (very intuitive and obvious). Interestingly, after 1995, the revenue for adventure flicks have increased exponentially since 1995. This could be due to the increasing popularity of film adaptions of popular comic books (super hero films) or adventure/fantasy books (Lord of the Rings and Harry Potter film series) after the 1990s.

Themes

We wanted to identify general themes that might be popular choices for successful movies

NOTE : We used the updated dataset (variable = data) for this analysis

movies<- data
movies<-movies%>%mutate(year=ifelse(year>2015,year-100,year))
addition<-read.csv("movies_aditionalinfo.csv")
ad<-movies%>%left_join(addition,by="TMDBID")

# Get relevant keywords
words<-ad%>%select(year,TMDBID,revenue,num_rating,keyword1,keyword2,keyword3)%>%
  gather(key=rank,value=keyword,-c(year,TMDBID,num_rating,revenue))
words<-words%>%filter(! keyword %in% stop_words$word)

# Poplarity of themes  between 1995-2015
words%>%filter(!is.na(keyword))%>%
  count(keyword,sort=TRUE)%>%
  filter(n>20)%>%
  mutate(word=reorder(keyword,n))%>%
  ggplot(aes(word,n))+geom_bar(stat="identity")+coord_flip()

# Popularity of Themes between 1995-2005
words%>%filter(year<2005,!is.na(keyword))%>%
  count(keyword,sort=TRUE)%>%
  filter(n>10)%>%
  mutate(word=reorder(keyword,n))%>%
  ggplot(aes(word,n))+geom_bar(stat="identity")+coord_flip()

# Popularity of Themes between 2005-2015
words%>%filter(year>=2005,!is.na(keyword))%>%
  count(keyword,sort=TRUE)%>%
  filter(n>10)%>%
  mutate(word=reorder(keyword,n))%>%
  ggplot(aes(word,n))+geom_bar(stat="identity")+coord_flip()

## Word Cloud
pal <- brewer.pal(9,"BuGn")
pal <- pal[-(1:4)]
# Popularity of themes between 1995-2015
common<-words%>%filter(!is.na(keyword))%>%
  count(keyword,sort=TRUE)
wordcloud(common$keyword,common$n,min.freq =18,scale=c(3,.5),random.order=TRUE, colors=pal)

# Popularity of Themes between 1995-2005
common<-words%>%filter(year<2005,!is.na(keyword))%>%
  count(keyword,sort=TRUE)
wordcloud(common$keyword,common$n,min.freq = 8,scale=c(3,.5),random.order=TRUE, colors=pal)

# Popularity of Themes between 2005-2015
common<-words%>%filter(year>=2005,!is.na(keyword))%>%
  count(keyword,sort=TRUE)
wordcloud(common$keyword,common$n,min.freq = 8,scale=c(3,.5),random.order=TRUE, colors=pal)

We managed to idenfiy some popular themes such as : “base on novel”, “newyork”,“dystopia”,“superhero”,“saving the world”,“murder”,“sport”,“prison”. Novel adaptations have been popular in the last two decades, and its popularity skyrocketed in the last decade. Hence, our previous intuition regarding that the increased popularity of action and adventure films were due to the rising popularity of novel or book adaptations does not seem far-fetched.

ave=median(movies$revenue)
ave
## [1] 54678386
k<-"based on novel"
t<-words%>%filter(keyword==k)
ratio<-mean(as.numeric(t$revenue)/ave,na.rm=TRUE)
plot_table<-data_frame(keywords=k,ratio_against_median=ratio)
k<-"new york"
t<-words%>%filter(keyword==k)
ratio<-mean(as.numeric(t$revenue)/ave,na.rm=TRUE)
plot_table<-bind_rows(plot_table,data_frame(keywords=k,ratio_against_median=ratio))

k<-"dystopia"
t<-words%>%filter(keyword==k)
ratio<-mean(as.numeric(t$revenue)/ave,na.rm=TRUE)
plot_table<-bind_rows(plot_table,data_frame(keywords=k,ratio_against_median=ratio))

k<-"superhero"
t<-words%>%filter(keyword==k|keyword=="superhero team")
ratio<-mean(as.numeric(t$revenue)/ave,na.rm=TRUE)
plot_table<-bind_rows(plot_table,data_frame(keywords=k,ratio_against_median=ratio))

k<-"saving the world"
t<-words%>%filter(keyword==k)
ratio<-mean(as.numeric(t$revenue)/ave,na.rm=TRUE)
plot_table<-bind_rows(plot_table,data_frame(keywords=k,ratio_against_median=ratio))

k<-"murder"
t<-words%>%filter(keyword==k)
ratio<-mean(as.numeric(t$revenue)/ave,na.rm=TRUE)
plot_table<-bind_rows(plot_table,data_frame(keywords=k,ratio_against_median=ratio))

k<-"sport"
t<-words%>%filter(keyword==k)
ratio<-mean(as.numeric(t$revenue)/ave,na.rm=TRUE)
plot_table<-bind_rows(plot_table,data_frame(keywords=k,ratio_against_median=ratio))

k<-"prison"
t<-words%>%filter(keyword==k)
ratio<-mean(as.numeric(t$revenue)/ave,na.rm=TRUE)
plot_table<-bind_rows(plot_table,data_frame(keywords=k,ratio_against_median=ratio))
plot_table%>%kable
keywords ratio_against_median
based on novel 2.8221288
new york 1.9204060
dystopia 2.6968427
superhero 3.4275342
saving the world 6.7525438
murder 0.7843119
sport 1.4021722
prison 1.5334560
plot_table%>%ggplot(aes(x=keywords,y=ratio_against_median))+geom_bar(stat="identity")+
  theme(axis.text.x  = element_text(angle=90, vjust=0.5))

The above table and bar plot charts the ratio of between the average revenue of a movie from a particular theme and the median revenue of all movies between 1995 and 2015. This time, we looked deeper into the “superhero” and “saving the world category”, and clearly those moves are doing between around 4 and 6 times better than any median movie in terms of revenue gain.

Production Company’s prefernce

We looked at the number of movies in the 6 major genres (Action, Adventure, Comedy, Drama, Romance and Thriller) made by top 10 production companies made after 1987, to get a sense of a top production company’s prefernce for a particular genre.

prefer=data %>% group_by(production) %>%
  summarize(Action=sum(Action),
            Adventure=sum(Adventure),
            Animation=sum(Animation),
            Comedy=sum(Comedy),
            Crime=sum(Crime),
            Documentary=sum(Documentary),
            Drama=sum(Drama),
            Family=sum(Family),
            Fantasy=sum(Fantasy),
            History=sum(History),
            Horror=sum(Horror),
            Music=sum(Music),
            Mystery=sum(Mystery),
            Romance=sum(Romance),
            Thriller=sum(Thriller),
            War=sum(War),
            Western=sum(Western),
            s_production=mean(s_production),
            ratio=sum(revenue)/sum(as.double(budget)))

# make a geom_bar('stack here showing the effect')
prefer<-prefer[order(-prefer$s_production),]

#top 10 compabnies
prefer%>%slice(1:10)
## Source: local data frame [10 x 20]
## 
##            production Action Adventure Animation Comedy Crime Documentary
##                 (chr)  (int)     (int)     (int)  (int) (int)       (int)
## 1  Universal Pictures     87        58        10    103    42           0
## 2      Paramount Film     81        61         5     62    39           1
## 3    20th Century Fox     63        38        10     97    30           0
## 4   Columbia Pictures     65        40         8     83    38           0
## 5         Walt Disney     51        72        49     79     5           0
## 6     New Line Cinema     35        18         0     68    31           0
## 7         Warner Bros     40        36        16     48    30           1
## 8                Sony     22        16         3     26    13           1
## 9       Miramax Films     12         2         0     32    20           0
## 10         DreamWorks     17        19        21     29     5           0
## Variables not shown: Drama (int), Family (int), Fantasy (int), History
##   (int), Horror (int), Music (int), Mystery (int), Romance (int), Thriller
##   (int), War (int), Western (int), s_production (dbl), ratio (dbl)
dat_p<-prefer %>% slice(1:10)%>%
  select(production,Action,Adventure,Comedy,Romance,Drama,Thriller)
dat_p<-dat_p%>%gather(key=type,value=number,-production)

# Plotting

ggplot(data = dat_p, aes(x = production, y = number, fill = type)) +geom_bar(stat="identity")+theme(axis.text.x  = element_text(angle=90, vjust=0.5))

prod=prefer$production
prefer1= prefer%>% select(-production,-s_production)
genre_prefer=colnames(prefer1)[apply(prefer1,1,which.max)]
prefer_genre=data.frame(prod,genre_prefer)

No particular major genre stands out as a speciality for any of the production company. We will consider s_production as our score for a production company.

Assigning Score

Actors

We separated the single columm of actors (in list form) into 5 separate columns for each actor. The actor score an actor estimates the actor’s potential to bring in the “big bucks” and is based on the revenue of the actor’s movie. We calculated the average budget of all movies for each year and the budget proportion of every movie. The score for every actor is the sum of all budget proportions for the actor’s movies multiplied by a factor that accounts for the number of movies the actor appeared in. We also calculated the individual genre score to gain a sense of the the actor’s preferred genre.

# Calculate Average Budget of movies per year from 1996 onwards
dat <- data %>% 
  filter( releaseYear >= 1996) %>%
  group_by(year) %>% 
  mutate(year_bud_ave=mean(budget,na.rm=TRUE))

# Calculate budget proportion of each movies = budget of movie/mean budget of that year
dat <-dat %>% 
  mutate(budget_p=budget/year_bud_ave*10)

# Actor score
wide_actors <- dat %>% select(TMDBID, title, rating, star1:star5, Action, Adventure, Comedy, Drama, Family, Fantasy, Horror, Mystery, Thriller,budget_p)
long_actors <- wide_actors %>% gather(key = star, value = name, -c(TMDBID, title, rating, Action, Adventure, Comedy, Drama, Family, Fantasy, Horror, Mystery, Thriller,year,budget_p))

t_actors <- long_actors %>% mutate(Action = ifelse(Action==TRUE,budget_p,0))%>%
  mutate(Adventure=ifelse(Adventure==TRUE,budget_p,0))%>%
  mutate(Comedy=ifelse(Comedy==TRUE,budget_p,0))%>%
  mutate(Drama=ifelse(Drama==TRUE,budget_p,0))%>%
  mutate(Family=ifelse(Family==TRUE,budget_p,0))%>%
  mutate(Fantasy=ifelse(Fantasy==TRUE,budget_p,0))%>%
  mutate(Horror=ifelse(Horror==TRUE,budget_p,0))%>%
  mutate(Mystery=ifelse(Mystery==TRUE,budget_p,0))%>%
  mutate(Thriller=ifelse(Thriller==TRUE,budget_p,0))

## Score by Genre
score_actors <-t_actors %>% group_by(name)%>%
  summarize(s_Action=sum(Action),s_Adventure=sum(Adventure),s_Comedy=sum(Comedy),s_Drama=sum(Drama),s_Family=sum(Family),s_Fantasy=sum(Fantasy),s_Horror=sum(Horror),s_Mystery=sum(Mystery),s_Thriller=sum(Thriller))

## Overall Score
actor_score_f<-t_actors %>% group_by(name)%>%
  summarize(a_n=n(), a_score=sum(budget_p)*((a_n+2)/a_n))
#write_csv(actor_score_f %>% select(-a_n), "actor_score_f.csv")

# Exploratory Data analysis : Top 10 Actors and their preference
t<-actor_score_f%>%left_join(score_actors)
## Joining by: "name"
t<-t[order(-t$a_score),]
t%>%slice(1:10)
## Source: local data frame [10 x 12]
## 
##                 name   a_n  a_score s_Action s_Adventure  s_Comedy
##                (chr) (int)    (dbl)    (dbl)       (dbl)     (dbl)
## 1        Johnny Depp    30 649.6986 273.3855   372.22880 127.86860
## 2         Will Smith    18 475.2065 337.8488   165.46347 201.25300
## 3          Brad Pitt    28 471.3797 153.4026    67.66111  80.69946
## 4  Samuel L. Jackson    35 461.3139 290.8560   184.36619  57.87178
## 5       Ian McKellen    13 458.4811 240.0343   365.12055  35.55638
## 6       Bruce Willis    32 453.0815 309.8289   138.40363  98.92104
## 7       Nicolas Cage    32 450.6173 260.8366   128.01771  57.61119
## 8       Hugh Jackman    18 445.5225 266.0511   267.37104  36.51092
## 9         Tom Cruise    19 431.3779 250.6787   176.49465  28.18133
## 10    Angelina Jolie    20 412.2026 230.4082   127.76656  54.94051
## Variables not shown: s_Drama (dbl), s_Family (dbl), s_Fantasy (dbl),
##   s_Horror (dbl), s_Mystery (dbl), s_Thriller (dbl)
dat_p<-t %>% slice(1:10) %>%
  mutate(s_Others = s_Family + s_Fantasy + s_Horror + s_Mystery) %>%
  select(name,s_Action,s_Adventure,s_Comedy,s_Drama,s_Thriller, s_Others)
dat_p<-dat_p%>%gather(key=type,value=score,-name)
ggplot(data = dat_p, aes(x = name, y = score, fill = type)) +geom_bar(stat="identity")+theme(axis.text.x  = element_text(angle=90, vjust=0.5))+ggtitle("Top 10 actors and their score in each genre")

The table details the ranked score for top 10 directors as well as some of the individual genre scores.

The stacked barplot breaks displays the absolute score ( preference) for each director in each genre. Note that the plot is not in any order and the overall height of the multicolored bar does not reflect the overall score for each director, because a movie is classified in multiple genres.

Directors

Directors were scored similar to actors. The director score is based on the number of ratings (not the rating itself) of the director’s movie. Our rationale is that the number of ratings for a particular movie indicate the movie’s popularity among the audience, and has a higher influence on the director’s potential to direct a box office success.

We calculated the average number of ratings of all movies for each year and the ratings’ numbers’ proportion for every movie. The overall score for a director is the sum of all rating proportions for the director’s movies multiplied by a factor that accounts for the number of movies the director directed. We also calculated the individual genre score to gain a sense of the the director’s preferred genre.

# Average number of ratings per year and the rating proportion
dat<-dat %>% 
  group_by(year) %>%
  mutate(year_rate_ave=mean(num_rating,na.rm=TRUE))
dat<-dat%>%mutate(rating_p=num_rating/year_rate_ave*10)

long_directors <- dat %>% select(TMDBID, title, rating_p, director, Action, Adventure, Comedy, Drama, Family, Fantasy, Horror, Mystery, Thriller)

t_directors <- long_directors %>% mutate(Action = ifelse(Action==TRUE,rating_p,0))%>%
  mutate(Adventure=ifelse(Adventure==TRUE,rating_p,0))%>%
  mutate(Comedy=ifelse(Comedy==TRUE,rating_p,0))%>%
  mutate(Drama=ifelse(Drama==TRUE,rating_p,0))%>%
  mutate(Family=ifelse(Family==TRUE,rating_p,0))%>%
  mutate(Fantasy=ifelse(Fantasy==TRUE,rating_p,0))%>%
  mutate(Horror=ifelse(Horror==TRUE,rating_p,0))%>%
  mutate(Mystery=ifelse(Mystery==TRUE,rating_p,0))%>%
  mutate(Thriller=ifelse(Thriller==TRUE,rating_p,0))

# Director Score by Genre
score_director <- t_directors %>% group_by(director)%>%
  summarize(s_Action=sum(Action),s_Adventure=sum(Adventure),s_Comedy=sum(Comedy),s_Drama=sum(Drama),s_Family=sum(Family),s_Fantasy=sum(Fantasy),s_Horror=sum(Horror),s_Mystery=sum(Mystery),s_Thriller=sum(Thriller))

# Overall Score
director_score_f <- dat %>% 
  group_by(director) %>% 
  summarize(d_n = n(), d_score=sum(rating_p)*((d_n+2)/d_n))
#write_csv(director_score_f %>% select(-d_n), "director_score_f.csv")

# Exploratory Data Analysis : Top 10 directors and their preferences
t2<-director_score_f%>%left_join(score_director)
## Joining by: "director"
t2<-t2[order(-t2$d_score),]
t2 %>% slice(1:10) 
## Source: local data frame [10 x 12]
## 
##             director   d_n  d_score s_Action s_Adventure s_Comedy
##               (fctr) (int)    (dbl)    (dbl)       (dbl)    (dbl)
## 1  Christopher Nolan     7 632.3508 328.6620    42.88449  0.00000
## 2      Peter Jackson     9 597.1715 426.9339   472.83729  7.30490
## 3      James Cameron     2 507.8315 253.9157   141.05108  0.00000
## 4   Steven Spielberg    13 426.0414 132.6179   138.24774 21.77606
## 5      David Fincher     8 382.2203   0.0000     0.00000  0.00000
## 6  Quentin Tarantino     8 375.6777 228.3330     0.00000  0.00000
## 7        Michael Bay    10 369.7801 289.6038   280.64000 19.06767
## 8        David Yates     4 339.7938   0.0000   226.52921  0.00000
## 9     Gore Verbinski     9 331.0279 228.8562   244.14125 31.22040
## 10      George Lucas     3 310.5124 186.3074   186.30743  0.00000
## Variables not shown: s_Drama (dbl), s_Family (dbl), s_Fantasy (dbl),
##   s_Horror (dbl), s_Mystery (dbl), s_Thriller (dbl)
dat_p<-t2 %>% slice(1:10) %>%
  mutate(s_Others = s_Family + s_Fantasy + s_Horror + s_Mystery) %>%
  select(director,s_Action,s_Adventure,s_Comedy,s_Drama,s_Thriller,s_Others)
dat_p<-dat_p%>%gather(key=type,value=score,-director)
ggplot(data = dat_p, aes(x = director, y = score, fill = type)) +geom_bar(stat="identity")+theme(axis.text.x  = element_text(angle=90, vjust=0.5))+ ggtitle("Top 10 actors and their score in each genre")

The table details the ranked score for top 10 directors as well as some of the individual genre scores.

The stacked barplot breaks displays the absolute score ( preference) for each director in each genre. Note that the plot is not in any order and the overall height of the multicolored bar does not reflect the overall score for each director, because a movie is classified in multiple genres.

Evaluate director and actor

We want to explore whether a movie with several good actors make more money than a movie with a singular good actor

# Can also be read from the tables below
#score_actors <- read.csv("score_actor_f.txt")
#score_director <- read.csv("score_director_f.txt")

score_actors <- actor_score_f %>% select(c(name, a_score))
score_director <- director_score_f %>% 
  select(director, d_score) %>% 
  mutate(director=as.character(director))

#data_bkup -> data
data<-data%>%mutate(director=as.character(director),star1=as.character(star1),star2=as.character(star2),star3=as.character(star3),star4=as.character(star4),star4=as.character(star4),star5=as.character(star5))
data<-left_join(data,score_director,by.x="director",by.y="director")
## Joining by: "director"
t<-data%>%select(TMDBID,star1:star5)%>%left_join(score_actors,by=c("star1"="name"))
colnames(t)<-c("TMDBID",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1")
t<-t%>%left_join(score_actors,by=c("star2"="name"))
colnames(t)<-c("TMDBID",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1","a_score2")
t<-t%>%left_join(score_actors,by=c("star3"="name"))
colnames(t)<-c("TMDBID",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1","a_score2","a_score3")
t<-t%>%left_join(score_actors,by=c("star4"="name"))
colnames(t)<-c("TMDBID",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1","a_score2","a_score3","a_score4")
t<-t%>%left_join(score_actors,by=c("star5"="name"))
colnames(t)<-c("TMDBID",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1","a_score2","a_score3","a_score4","a_score5")
t<-t%>%mutate(a_score1=ifelse(is.na(a_score1),0,a_score1),a_score2=ifelse(is.na(a_score2),0,a_score2),a_score3=ifelse(is.na(a_score3),0,a_score3),a_score4=ifelse(is.na(a_score4),0,a_score4),a_score5=ifelse(is.na(a_score5),0,a_score5))

# Evaluate actors
data<-data%>%mutate(budget_ratio=budget/median(budget))

t<-t%>%mutate(first_star_potion=a_score1/(a_score1+a_score2+a_score3+a_score4+a_score5))
t<-t%>%mutate(first_star_potion=ifelse(first_star_potion==Inf,0,first_star_potion))
dat_star<-data%>%select(TMDBID,revenue,budget_ratio)%>%left_join(t,by="TMDBID")

dat_star%>%ggplot(aes(x=first_star_potion,y=revenue))+geom_point()

fit<-lm(revenue~first_star_potion,data = dat_star)
summary(fit)
## 
## Call:
## lm(formula = revenue ~ first_star_potion, data = dat_star)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -140123109 -104935225  -66746020   29214681 2653917116 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       140123115    6636965  21.113  < 2e-16 ***
## first_star_potion -49703181   16518065  -3.009  0.00264 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 190800000 on 2967 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.003042,   Adjusted R-squared:  0.002706 
## F-statistic: 9.054 on 1 and 2967 DF,  p-value: 0.002643
# Account for confounding from budget

fit<-lm(revenue~first_star_potion+budget_ratio,data = dat_star)
summary(fit)
## 
## Call:
## lm(formula = revenue ~ first_star_potion + budget_ratio, data = dat_star)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -680358873  -53296026  -10638514   23987047 2065080123 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5244462    5469310  -0.959    0.338    
## first_star_potion -6042851   11830538  -0.511    0.610    
## budget_ratio      85440653    1601632  53.346   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 136300000 on 2966 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.4912, Adjusted R-squared:  0.4909 
## F-statistic:  1432 on 2 and 2966 DF,  p-value: < 2.2e-16
# Join actor score to main table

t<-t%>%mutate(a_score_t=(0.4*a_score1+0.30*a_score2+0.20*a_score3+0.05*a_score4+0.05*a_score5))

data<-t%>%select(TMDBID,a_score_t,first_star_potion)%>%left_join(data,by="TMDBID")

From the first linear fit without budget considerations, it seems that money spent of the first actor does influence the revenue of a movie. But we need to wory about confounding effect of bugdet, since high budget movies will earn more money regardless of the money they spend on better actors. After we account for confounding in terms of a budget ratio of the movie (movie’s budget/median budget of all films after 1987), we see that the effect of expenditure on first actor disappears. Hence, we need to consider stratifying on movie’s budget category (low, medium or high budget) for further analysis.

Add influential keywords

# Adding major key points to the main table (variable = data)
ad <- addition
ad<-ad%>%mutate(I_superhero=(keyword1=="superhero"|keyword2=="superhero"|keyword3=="superhero"|keyword1=="superhero team"|keyword2=="superhero team"|keyword3=="superhero team"))
ad<-ad%>%mutate(I_saving_world=(keyword1=="saving the world"|keyword2=="saving the world"|keyword3=="saving the world"))
ad<-ad%>%mutate(I_superhero=ifelse(is.na(I_superhero),FALSE,I_superhero))
ad<-ad%>%mutate(I_saving_world=ifelse(is.na(I_saving_world),FALSE,I_saving_world))
tmp<-ad%>%select(TMDBID,I_saving_world,I_superhero)
data<-data%>%left_join(tmp,by="TMDBID")
data<-data%>%mutate(profit=revenue-budget)
max(data$profit)
## [1] 2544505847
median(data$profit)
## [1] 23489268
quantile(data$profit,0.98)
##       98% 
## 605475499
data<-data%>%mutate(profit=ifelse(profit>605475499,605475499,profit  ))
data<-data%>%mutate(profit_r=profit/median(profit))

data_checkpoint1<-data

Runtime

Do longer movies garner more profit?

data %>%
  mutate(runtime=10*round(runtime/10)) %>%
  group_by(runtime) %>%
  summarise(mean_profit=mean(profit)) %>%
  ggplot(aes(x=runtime,y=mean_profit))+geom_point()+scale_y_continuous(limits = c(0, 500000000))+scale_x_continuous(limits = c(10, 350))+xlab("run time (mins)")+ylab("mean profi")

We can see that longer movies do seem to garner more profit. However, budget can be a confounder in this case, because longer movies generally have higher budget.

Number of ratings

We explored the effect of the number of ratings on the prospect of success of a movie, i.e. profit/budget. We looked separately in each of the genres

num_rat=data  %>% 
  gather(key = genre, value=check ,  Action:Western) %>%
  filter(check==1) %>%
  select(-check) %>%
  group_by(genre) %>%
  mutate(count=n()) %>%
  ungroup %>%
  filter(count>100)
num_rat %>% 
  ggplot(aes(x=num_rating,y=log(revenue/budget)))+geom_point(aes(color=genre))+geom_smooth(span=0.02,col="blue")+scale_y_continuous(limits = c(-2.5, 2.5))+ facet_wrap(~genre)+xlab(" number of ratings")+ylab("log(profit_ratio)")

The graphs above indicate a clear positive correlation between the prospect of success and the number of ratings for action, adventure, fantasies, history, mystery and romance films. However, the number of ratings is only immediately available before a film’s release, and therefore will not be used to in our models

Budget

When it comes to movie’s profit, the first thing might came to our mind is a production company invests, the more it will gain in profits. Is it true?

#let's look at our main predictor distribution first
par(mfrow=c(1,2))
hist(data$budget)
hist(data$profit)

#they are oviously skewed, let't take the log transformation of them 
data<-data%>%mutate(profit=log(profit+300000000))
min(data$budget)
## [1] 1
median(data$budget)
## [1] 2.8e+07
data<-data%>%mutate(budget=log(budget+100))
hist(data$budget)
hist(data$profit)

# visualize points (we filtered low outliers)
data%>%filter(budget>15)%>%ggplot(aes(x=budget,y=profit))+geom_point()+xlab("log(budget)")+ylab("log(profit)")

# fit linear model with budget and profit
fit_budget<-lm(profit~budget,data = data)
summary(fit_budget)
## 
## Call:
## lm(formula = profit ~ budget, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12722 -0.16212 -0.05964  0.09692  0.89277 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.677252   0.051484  362.78   <2e-16 ***
## budget       0.060210   0.003028   19.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.262 on 2990 degrees of freedom
## Multiple R-squared:  0.1168, Adjusted R-squared:  0.1165 
## F-statistic: 395.5 on 1 and 2990 DF,  p-value: < 2.2e-16
#diagnostic plots, residue plot
par(mfrow=c(2,2))
plot(fit_budget)

par(mfrow=c(1,1))
# Get profit
mean(select(filter(data,budget>median(data$budget,na.rm = TRUE)),profit)>median(data$profit))
## [1] 0.6548257

First we looked at the distribution of both profit and budget. We find that they are both skewed and there decided to log transform both both budget and profit.

After fitting a linear model, we find that log(budget) predicts a movie’s log(profit). The QQ plot looks approximately normal in the middle range. Hence, big budget films should make more profit. In fact, we find that 67% of the movies have a budget above the meadian budget will make more protfit than the median profit of all films.

NOTE : After log transformation, our variables budget and profit refer to the log transformed variables respectively.


Final analysis : Model Building

Important factors to predict a movie’s profit

Linear Regression

First, we try a linear regression with a few predictors we created in the previous sections to get a rough idea of which predictors can be useful for further analysis.

# Prepare data for regression table
# Adding season factor => 0: other, 1: summer
data<-data%>%mutate(season=ifelse((data$m>=4 &data$m<=8),1,0))
# Categories
data<-data%>%mutate(d_score=ifelse(is.na(d_score),0.01,d_score))
data<-data%>%mutate(Action=as.numeric(Action),Adventure=as.numeric(Adventure),Animation=as.numeric(Animation),Comedy=as.numeric(Comedy),Crime=as.numeric(Crime),Drama=as.numeric(Drama),Romance=as.numeric(Romance),Thriller=as.numeric(Thriller),I_superhero=as.numeric(I_superhero),I_saving_world=as.numeric(I_saving_world),first_star_potion=first_star_potion*10)
# First actor's portion of budget = which we predict is around 40% of budget
data<-data%>%mutate(first_star_potion=ifelse(first_star_potion==0,0.1,first_star_potion))
dat_checkpoint2<-data

dat<-data%>%select(TMDBID,profit,a_score_t,first_star_potion,runtime,budget_ratio,year,Action,Adventure,Animation,Comedy,Crime,Drama,Romance,Thriller,USA,s_production,d_score,I_saving_world,I_superhero,season)

dat<-dat%>%select(-c(TMDBID))

# Build regression model
dat<-dat[complete.cases(dat),]
fit=lm(profit~.,data=dat)
summary(fit)
## 
## Call:
## lm(formula = profit ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72787 -0.11207 -0.02003  0.09329  0.97772 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.586e+01  1.200e+00  13.214  < 2e-16 ***
## a_score_t          2.722e-05  6.186e-05   0.440  0.65998    
## first_star_potion  3.492e-03  2.016e-03   1.732  0.08337 .  
## runtime            1.235e-03  2.686e-04   4.597 4.46e-06 ***
## budget_ratio       5.212e-02  4.088e-03  12.749  < 2e-16 ***
## year               1.757e-03  5.969e-04   2.943  0.00327 ** 
## Action            -2.802e-02  1.085e-02  -2.582  0.00988 ** 
## Adventure          2.695e-02  1.225e-02   2.199  0.02796 *  
## Animation          1.394e-01  2.005e-02   6.952 4.41e-12 ***
## Comedy             1.113e-02  1.040e-02   1.071  0.28436    
## Crime             -1.488e-02  1.160e-02  -1.283  0.19951    
## Drama             -4.587e-02  9.896e-03  -4.635 3.72e-06 ***
## Romance            3.180e-02  1.168e-02   2.724  0.00649 ** 
## Thriller          -8.000e-03  1.057e-02  -0.757  0.44929    
## USA                1.918e-02  9.275e-03   2.067  0.03877 *  
## s_production       2.037e-04  4.945e-05   4.120 3.89e-05 ***
## d_score            8.318e-04  5.291e-05  15.721  < 2e-16 ***
## I_saving_world     8.353e-02  4.563e-02   1.831  0.06726 .  
## I_superhero       -4.494e-02  6.222e-02  -0.722  0.47021    
## season             3.180e-02  8.570e-03   3.711  0.00021 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.222 on 2949 degrees of freedom
## Multiple R-squared:  0.3727, Adjusted R-squared:  0.3687 
## F-statistic: 92.23 on 19 and 2949 DF,  p-value: < 2.2e-16

Sadly, after adjusting for other variables, saving the world and super hero movies won’t let you make more money.

Model selection

Now, we refine our initial model and try model selection using AIC and forward-backward selection to identify key predictors in a systematically.

step <- MASS::stepAIC(fit, direction="both",na.rm=TRUE,trace=FALSE)

step$anova %>% kable
Step Df Deviance Resid. Df Resid. Dev AIC
NA NA 2949 145.2890 -8918.232
- a_score_t 1 0.0095378 2950 145.2985 -8920.037
- Thriller 1 0.0249668 2951 145.3235 -8921.527
- I_superhero 1 0.0234143 2952 145.3469 -8923.049
#Final Model:profit = 
#first_star_potion + runtime + budget_ratio + year + Action + Adventure + Animation + Drama + Romance + Thriller +  USA + s_production + d_score + season

# fitting model with budget ratio
fit_profit=lm(profit ~ first_star_potion + runtime + budget_ratio + year + 
    Action + Adventure + Animation + Drama + Romance + Thriller + 
    USA + s_production + d_score + season,data=data)
summary(fit_profit)
## 
## Call:
## lm(formula = profit ~ first_star_potion + runtime + budget_ratio + 
##     year + Action + Adventure + Animation + Drama + Romance + 
##     Thriller + USA + s_production + d_score + season, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73709 -0.11171 -0.01942  0.09282  0.97566 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.595e+01  1.193e+00  13.368  < 2e-16 ***
## first_star_potion  3.599e-03  1.955e-03   1.841 0.065655 .  
## runtime            1.169e-03  2.611e-04   4.477 7.86e-06 ***
## budget_ratio       5.369e-02  3.686e-03  14.565  < 2e-16 ***
## year               1.717e-03  5.936e-04   2.893 0.003843 ** 
## Action            -3.113e-02  1.067e-02  -2.917 0.003561 ** 
## Adventure          2.826e-02  1.215e-02   2.325 0.020140 *  
## Animation          1.366e-01  2.001e-02   6.828 1.04e-11 ***
## Drama             -4.890e-02  9.540e-03  -5.126 3.15e-07 ***
## Romance            3.548e-02  1.143e-02   3.103 0.001933 ** 
## Thriller          -1.405e-02  9.442e-03  -1.488 0.136925    
## USA                1.986e-02  9.202e-03   2.158 0.030994 *  
## s_production       2.076e-04  4.920e-05   4.220 2.52e-05 ***
## d_score            8.371e-04  5.223e-05  16.028  < 2e-16 ***
## season             3.247e-02  8.544e-03   3.801 0.000147 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.222 on 2954 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.3713, Adjusted R-squared:  0.3683 
## F-statistic: 124.6 on 14 and 2954 DF,  p-value: < 2.2e-16
# fitting model with budget ratio
fit_profit=lm(profit ~ first_star_potion + runtime + budget_ratio + year + 
    Action + Adventure + Animation + Drama + Romance + Thriller + 
    USA + s_production + d_score + season,data=data)
summary(fit_profit)
## 
## Call:
## lm(formula = profit ~ first_star_potion + runtime + budget_ratio + 
##     year + Action + Adventure + Animation + Drama + Romance + 
##     Thriller + USA + s_production + d_score + season, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73709 -0.11171 -0.01942  0.09282  0.97566 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.595e+01  1.193e+00  13.368  < 2e-16 ***
## first_star_potion  3.599e-03  1.955e-03   1.841 0.065655 .  
## runtime            1.169e-03  2.611e-04   4.477 7.86e-06 ***
## budget_ratio       5.369e-02  3.686e-03  14.565  < 2e-16 ***
## year               1.717e-03  5.936e-04   2.893 0.003843 ** 
## Action            -3.113e-02  1.067e-02  -2.917 0.003561 ** 
## Adventure          2.826e-02  1.215e-02   2.325 0.020140 *  
## Animation          1.366e-01  2.001e-02   6.828 1.04e-11 ***
## Drama             -4.890e-02  9.540e-03  -5.126 3.15e-07 ***
## Romance            3.548e-02  1.143e-02   3.103 0.001933 ** 
## Thriller          -1.405e-02  9.442e-03  -1.488 0.136925    
## USA                1.986e-02  9.202e-03   2.158 0.030994 *  
## s_production       2.076e-04  4.920e-05   4.220 2.52e-05 ***
## d_score            8.371e-04  5.223e-05  16.028  < 2e-16 ***
## season             3.247e-02  8.544e-03   3.801 0.000147 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.222 on 2954 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.3713, Adjusted R-squared:  0.3683 
## F-statistic: 124.6 on 14 and 2954 DF,  p-value: < 2.2e-16
# sadly after adjust for other factor super hero movies won't let you make more money
augmented <- augment(fit_profit)
augmented%>%ggplot(aes(x=.hat,y=.resid))+geom_point()+scale_x_continuous(limits=c(0,0.03))+geom_hline(yintercept = 0,color='red')+ggtitle("residual plot")

library(car)
#av.plots(fit_profit)
op <- par(no.readonly = TRUE)
par(mfrow=c(2,2))
plot(fit_profit)

par(op)
par(mfrow=c(1,1))

#There are some extreme outliers 2196, 2940, 394
movie<-read.csv("movies_3.csv")

#outlier movies
t<-data%>%slice(c(2196,2940,394,2753))%>%left_join(movie,by="TMDBID")%>%select(title.x,budget.x,profit)%>%mutate(p_vs_b=profit/budget.x)
t%>%kable()
title.x budget.x profit p_vs_b
Avatar 19.28357 20.62397 1.0695099
Star Wars: The Force Awakens 19.11383 20.62397 1.0790078
Titanic 19.11383 20.62397 1.0790078
The Lone Ranger 19.35677 18.71551 0.9668714
outlier<-t

After running model selection, we see that superhero and saving the world themes are not significantly associated with profit! This seems counterintuitive.

We see a very significant p-value and the budget itself explained around 30% of the profit a movie made. Not surprisingly, most superhero and saving the world themed films are big budgeted, which can explain why they do so well. We further performed 2 reduced models: one with actor score but not budget ratio included and vice versa. Since actor scores are derived from the movie’s budgets, those variablse are collinear and therefore we should only include one of them. We decided to go with budget ratio for further analysis.

It seems that in order to make money in the movie industry, a producer should aim for high budget films with a good director and a cast of good actors. However, our “first star potion” variable (which accounts for the money paid to the first star) is insignificant, indicating that it is more important to spread out the budget among multiple established actors. Hence a producer does not necessarily have to invest heavily in one lead actor. This discovery argues against the rationale of whitewashing, which emphasizes hiring singular good actors who would sell the movie.

We also looked at the outliers in our model, and not surprisingly we have two of the biggest hits of all time (Star Wars and Avatar) as well as one of the biggest flops of all time (The Lone Ranger)

Overall: Profitability is positively associated with higher budget, longer runtime, adventure and animation genres, good director and summer release and negatively associated with Action, Thriller and Drama genres.

How to make more money with less budget

However, having a higher budget seems unfair to the small budget movie producers, and certainly there are many examples where small budget films smashed box offices (Eg. Paranormal Activity). Hence, we are interested in learning the features that determine success in each of the budget categories.

We stratified budget into three categoris: low budget(<30% quantile), median budget(>=30%, <=70%), high budget(>=70%). And let’s find out what happens within each strata and across each strata and evaluate a movie’s ability to make money as the ratio of it’s profit against it’s budget.

NOTE : We use non log transformed budget and profit variables for all further analysis (except for decision tree at the end)

data<-data_checkpoint1
# Getting profit over budget ratio
data<-data%>%mutate(p_vs_b=profit/budget)
hist(data$p_vs_b)

# Outliers => biggest profit to budget ratio films
t<-data%>%filter(p_vs_b>50)%>%select(title,budget,profit,p_vs_b)
t%>%kable()
title budget profit p_vs_b
Clerks 27000 3124130 115.70852
The Full Monty 3500000 254350122 72.67146
Pi 60000 3161152 52.68587
Lost & Found 1 99 99.00000
The Blair Witch Project 25000 247975000 9919.00000
My Big Fat Greek Wedding 5000000 363744044 72.74881
Napoleon Dynamite 400000 45718097 114.29524
Super Size Me 65000 28510078 438.61658
Primer 7000 417760 59.68000
Saw 1200000 102711669 85.59306
Open Water 130000 54537954 419.52272
Facing the Giants 100000 10078331 100.78331
Once 160000 20550513 128.44071
Paranormal Activity 15000 193340800 12889.38667
Catfish 30000 3015943 100.53143
Paranormal Activity 2 3000000 174512032 58.17068
From Prada to Nada 93 2499907 26880.72043
Insidious 1500000 95509150 63.67277
The Devil Inside 1000000 100758490 100.75849
A Little Chaos 80000 10004623 125.05779
names(outlier) <-c("title", "budget", "profit", "p_vs_b")
outlier<-rbind(outlier,t)
data<-data%>%filter(p_vs_b<50) # filtering out outliers from our analysis
hist(data$p_vs_b,breaks = 5000)

x <- quantile(data$budget,0.3)
y <- quantile(data$budget,0.7)
#1 as low, 2 as median, 3 as high
data<-data%>%mutate(c_budget=ifelse(budget<=x,1,ifelse(budget>y,3,2)))

#make the histogram
data%>%ggplot(aes(x=budget,y=profit))+geom_point()+facet_wrap(~c_budget)+ggtitle("Budget vs profit in each budget strate - fixed scale")

data%>%ggplot(aes(x=budget,y=profit))+geom_point()+facet_wrap(~c_budget,scales = "free")+ggtitle("Budget vs profit in each budget strate - free scale")

data%>%ggplot(aes(profit))+geom_histogram(bins = 30)+facet_grid(c_budget~.,scales = "free")+ggtitle("Profit of 3 budget strata")

# Model fitting in each of the strata
require(broom)
fits<-data%>%group_by(c_budget)%>%
  do(mod=lm(profit~budget,data=.))
t<-tidy(fits,mod)
t<-as.data.frame(t)
t%>%filter(term=='budget')
##   c_budget   term  estimate std.error statistic      p.value
## 1        1 budget 2.6911162 0.3581740  7.513433 1.378338e-13
## 2        2 budget 0.9870594 0.2497722  3.951839 8.200565e-05
## 3        3 budget 2.1056388 0.1266834 16.621271 7.658950e-54
data%>%group_by(c_budget)%>%summarize(median_profit=median(profit))%>%ggplot(aes(x=c_budget,y=median_profit))+geom_bar(stat ="identity" )+xlab("Budget category")

data%>%ggplot(aes(x=as.factor(c_budget),y=profit))+geom_boxplot()

#p_vs_b ratio
data%>%group_by(c_budget)%>%summarize(median_profit_vs_budget=median(profit/budget))%>%ggplot(aes(x=c_budget,y=median_profit_vs_budget))+geom_bar(stat ="identity" )+ xlab("Budget category")+ylab("profit vs budget ratio")

data%>%mutate(profit_vs_budget=profit/budget)%>%filter(profit_vs_budget<100)%>%ggplot(aes(x=as.factor(c_budget),y=profit_vs_budget))+geom_boxplot()+xlab("Budget category") + ylab("profit vs budget ratio")

After stratifying our movies in budget categories, higher budget movies have more profitable movies, not that we are surprised by this. We then evaluated the significance of budget in each of the categories, and found that budget still exerts statistically significant influence on the profit.

We then “normalized” each of the categories by their respective median budgets. Interestingly, the exuberant differences in profit disappears. Hence, we can conclude that the return of profits is multiplicative with respect to budget.

Feature selection in each budget strata

Now let’s look at how a movie does within each budget strata and it’s relationship with other factors.

#require(dplyr)
#data<-as.data.frame(data)
data<-data%>%mutate(season=ifelse((month(data$date)>=4 &month(data$date)<=8),1,0))
data_checkpoint3<-data
dat_f<-data%>%select(TMDBID,p_vs_b,a_score_t,first_star_potion,runtime,year,Action,Adventure,Animation,Comedy,Crime,Drama,Romance,Thriller,USA,s_production,d_score,I_saving_world,I_superhero,c_budget,season)


dat_low<-dat_f%>%filter(c_budget==1)
# Number of films with low budget
nrow(dat_low)
## [1] 910
dat_median<-dat_f%>%filter(c_budget==2)
# Number of films with medium budget
nrow(dat_median)
## [1] 1221
dat_high<-dat_f%>%filter(c_budget==3)
# Number of films with high budget
nrow(dat_high)
## [1] 841

High budget film analysis

dat<-dat_high%>%select(-c(TMDBID,c_budget))
dat<-dat[complete.cases(dat),]

X<-data.matrix(dat)

library(corrplot)
cor = cor(X)
corrplot(cor,method = 'circle')

cor(X)[1,]
##            p_vs_b         a_score_t first_star_potion           runtime 
##        1.00000000        0.09550795       -0.06446110        0.12917623 
##              year            Action         Adventure         Animation 
##        0.09058907       -0.04074702        0.12754723        0.15664117 
##            Comedy             Crime             Drama           Romance 
##        0.02593791       -0.07855408       -0.08760020        0.03808010 
##          Thriller               USA      s_production           d_score 
##       -0.10528462        0.06124294        0.05364927        0.36248336 
##    I_saving_world       I_superhero            season 
##        0.05072048       -0.03287486        0.09433503
hist(data$p_vs_b)

hist(log10(data$p_vs_b+1.1))

data<-data%>%mutate(p_vs_b=log10(p_vs_b+1.1))
data<-data%>%filter(!is.na(p_vs_b))

# fit and model selection
fit=lm(p_vs_b~.,data=dat)
step <- MASS::stepAIC(fit, direction="both",na.rm=TRUE,trace=FALSE)
step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## p_vs_b ~ a_score_t + first_star_potion + runtime + year + Action + 
##     Adventure + Animation + Comedy + Crime + Drama + Romance + 
##     Thriller + USA + s_production + d_score + I_saving_world + 
##     I_superhero + season
## 
## Final Model:
## p_vs_b ~ runtime + year + Animation + Drama + Romance + USA + 
##     d_score + season
## 
## 
##                   Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                                         815   2184.604 841.1059
## 2              - Crime  1 0.2205296       816   2184.825 839.1901
## 3          - a_score_t  1 0.3978720       817   2185.223 837.3419
## 4        - I_superhero  1 0.6787685       818   2185.902 835.6009
## 5     - I_saving_world  1 0.7526602       819   2186.654 833.8880
## 6           - Thriller  1 1.1163148       820   2187.771 832.3137
## 7             - Comedy  1 1.1112199       821   2188.882 830.7372
## 8  - first_star_potion  1 1.9135383       822   2190.795 829.4660
## 9          - Adventure  1 2.3896600       823   2193.185 828.3752
## 10            - Action  1 1.9421978       824   2195.127 827.1134
## 11      - s_production  1 5.0903151       825   2200.218 827.0452
#final model:step$anova p_vs_b ~ runtime + year + Adventure + Animation + Drama + Romance + USA + s_production + d_score + season

fit_high=lm(p_vs_b ~ runtime + year + Adventure + Animation + Drama + Romance + 
    USA + s_production + d_score + season,data=dat)
summary(fit_high)
## 
## Call:
## lm(formula = p_vs_b ~ runtime + year + Adventure + Animation + 
##     Drama + Romance + USA + s_production + d_score + season, 
##     data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8588 -1.1232 -0.2602  0.8237  8.8920 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.166e+01  1.942e+01  -1.630   0.1034    
## runtime        7.908e-03  3.530e-03   2.240   0.0253 *  
## year           1.549e-02  9.672e-03   1.602   0.1095    
## AdventureTRUE  9.324e-02  1.259e-01   0.741   0.4591    
## AnimationTRUE  1.114e+00  1.894e-01   5.880 5.97e-09 ***
## DramaTRUE     -2.978e-01  1.417e-01  -2.102   0.0359 *  
## RomanceTRUE    7.480e-01  1.843e-01   4.059 5.40e-05 ***
## USA            2.298e-01  1.304e-01   1.762   0.0784 .  
## s_production   9.434e-04  6.679e-04   1.413   0.1582    
## d_score        5.399e-03  5.667e-04   9.527  < 2e-16 ***
## season         2.211e-01  1.178e-01   1.878   0.0608 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.633 on 823 degrees of freedom
## Multiple R-squared:  0.2035, Adjusted R-squared:  0.1938 
## F-statistic: 21.03 on 10 and 823 DF,  p-value: < 2.2e-16
# fit quality checking
op <- par(no.readonly = TRUE)
par(mfrow=c(2,2))
plot(fit_high)

par(op)
#outlier movies
t<-dat_high%>%slice(c(21,829,25))%>%left_join(movie,by="TMDBID")%>%mutate(profit=revenue-budget)%>%select(title,budget,profit,p_vs_b)
t%>%kable
title budget profit p_vs_b
Stargate 5.5e+07 141567262 2.5739502
San Andreas 1.1e+08 360490832 3.2771894
Wyatt Earp 6.3e+07 -37948000 -0.6023492
outlier<-rbind(outlier,t)

require(bootstrap)
## Loading required package: bootstrap
## 
## Attaching package: 'bootstrap'
## The following object is masked from 'package:broom':
## 
##     bootstrap
# define functions 
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit_high,x){cbind(1,x)%*%fit$coef} 

# matrix of predictors
X <- as.matrix(dat[c(-1)])
# vector of predicted values
y <- as.matrix(dat[c("p_vs_b")]) 

# measurement of model fitness
results <- crossval(X,y,theta.fit,theta.predict,ngroup=10)
cor(y, fit$fitted.values)**2 # raw R2 
##             [,1]
## p_vs_b 0.2068012
cor(y,results$cv.fit)**2 # cross-validated R2
##             [,1]
## p_vs_b 0.2068012

From the correlation plot, it seems that the director choice has a strong influence on the profitability.

According to our model, run time, director selection, production company and animation and romance exert statistically significant influence on profitability. Contrastingly, drama genre has negative influence on profit.

Medium budget film analysis

dat<-dat_median%>%select(-c(TMDBID,c_budget))
dat<-dat[complete.cases(dat),]

X<-data.matrix(dat)

require(corrplot)
cor = cor(X)
corrplot(cor,method = 'circle')

cor(X)[1,]
##            p_vs_b         a_score_t first_star_potion           runtime 
##        1.00000000        0.02373202        0.04488189        0.04488398 
##              year            Action         Adventure         Animation 
##       -0.14118207       -0.08294999        0.03219401        0.08686071 
##            Comedy             Crime             Drama           Romance 
##        0.12146113       -0.05592638       -0.07945249        0.07195591 
##          Thriller               USA      s_production           d_score 
##       -0.12898696        0.05888788        0.13732905        0.18821379 
##    I_saving_world       I_superhero            season 
##        0.03194612        0.02709791        0.09135648
# fit and model selection
fit=lm(p_vs_b~.,data=dat)
step <- MASS::stepAIC(fit, direction="both",na.rm=TRUE,trace=FALSE)
step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## p_vs_b ~ a_score_t + first_star_potion + runtime + year + Action + 
##     Adventure + Animation + Comedy + Crime + Drama + Romance + 
##     Thriller + USA + s_production + d_score + I_saving_world + 
##     I_superhero + season
## 
## Final Model:
## p_vs_b ~ first_star_potion + runtime + year + Action + Animation + 
##     Drama + Romance + Thriller + s_production + d_score + season
## 
## 
##               Step Df    Deviance Resid. Df Resid. Dev      AIC
## 1                                      1132   9834.637 2507.217
## 2 - I_saving_world  1  0.02908233      1133   9834.666 2505.220
## 3          - Crime  1  0.51472966      1134   9835.181 2503.280
## 4      - a_score_t  1  1.99479509      1135   9837.175 2501.514
## 5      - Adventure  1  4.79694242      1136   9841.972 2500.075
## 6         - Comedy  1  8.56968329      1137   9850.542 2499.077
## 7    - I_superhero  1 12.57861376      1138   9863.121 2498.545
## 8            - USA  1 13.54635075      1139   9876.667 2498.125
#final model:step$anova p_vs_b ~ first_star_potion + runtime + year + Action + Animation + Drama + Romance + Thriller + s_production + d_score + season

fit_median=lm(p_vs_b ~ first_star_potion + runtime + year + Action + Animation + Drama + Romance + Thriller + s_production + d_score + season,data=dat)
summary(fit_median)
## 
## Call:
## lm(formula = p_vs_b ~ first_star_potion + runtime + year + Action + 
##     Animation + Drama + Romance + Thriller + s_production + d_score + 
##     season, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7611 -1.7303 -0.6435  0.8509 21.2117 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       78.014026  25.823029   3.021  0.00257 ** 
## first_star_potion  0.709789   0.443996   1.599  0.11018    
## runtime            0.010066   0.005558   1.811  0.07036 .  
## year              -0.038860   0.012826  -3.030  0.00250 ** 
## ActionTRUE        -0.518156   0.215282  -2.407  0.01625 *  
## AnimationTRUE      1.446709   0.531008   2.724  0.00654 ** 
## DramaTRUE         -0.649737   0.199369  -3.259  0.00115 ** 
## RomanceTRUE        0.392443   0.239355   1.640  0.10137    
## ThrillerTRUE      -0.539802   0.199554  -2.705  0.00693 ** 
## s_production       0.003303   0.001013   3.259  0.00115 ** 
## d_score            0.007229   0.001187   6.091 1.54e-09 ***
## season             0.507962   0.184145   2.758  0.00590 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.945 on 1139 degrees of freedom
## Multiple R-squared:  0.106,  Adjusted R-squared:  0.09732 
## F-statistic: 12.27 on 11 and 1139 DF,  p-value: < 2.2e-16
# fit quiality 
op <- par(no.readonly = TRUE)
par(mfrow=c(2,2))
plot(fit_median)

par(op)

# outlier movies
t<-dat_high%>%slice(c(489,70,69))%>%left_join(movie,by="TMDBID")%>%mutate(profit=revenue-budget)%>%select(title,budget,profit,p_vs_b)
t%>%kable
title budget profit p_vs_b
American Gangster 1.0e+08 166465037 1.6646504
The Devil’s Advocate 5.7e+07 3984028 0.0698952
Seven Years in Tibet 7.0e+07 61457682 0.8779669
names(outlier) <- c("title", "budget", "profit", "p_vs_b")
names(t) <- c("title", "budget", "profit", "p_vs_b")
outlier<-rbind(outlier,t)

require(bootstrap)
# define functions 
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit_median,x){cbind(1,x)%*%fit$coef} 

# matrix of predictors
X <- as.matrix(dat[c(-1)])
# vector of predicted values
y <- as.matrix(dat[c("p_vs_b")]) 

results <- crossval(X,y,theta.fit,theta.predict,ngroup=10)
cor(y, fit$fitted.values)**2 # raw R2 
##             [,1]
## p_vs_b 0.1097596
cor(y,results$cv.fit)**2 # cross-validated R2
##             [,1]
## p_vs_b 0.1097596

From the correlation plot, it is difficult to pinpoint significant correlation between profitability and any of the other factors.

According to our model, director choice, summer release and animation and drama genre exert statistically significant influence on profitability. Hence medium budget film should aim for summer release.

The predictive power of our model(give by \(R^2\)) is half that of our high budget film model.

Low budget film analysis

dat<-dat_low%>%select(-c(TMDBID,c_budget))
dat<-dat[complete.cases(dat),]

X<-data.matrix(dat)

require(corrplot)
cor = cor(X)
corrplot(cor,method = 'circle')

cor(X)[1,]
##            p_vs_b         a_score_t first_star_potion           runtime 
##       1.000000000      -0.034764806       0.001972577       0.082227085 
##              year            Action         Adventure         Animation 
##      -0.035047130      -0.102629054      -0.048820545      -0.008611456 
##            Comedy             Crime             Drama           Romance 
##      -0.010136756      -0.087503414      -0.025248600       0.044598084 
##          Thriller               USA      s_production           d_score 
##      -0.036228118       0.017953620       0.066465968       0.169278951 
##    I_saving_world       I_superhero            season 
##                NA      -0.027240183      -0.012076510
# fit and model selection
fit=lm(p_vs_b~.,data=dat)
step <- MASS::stepAIC(fit, direction="both",na.rm=TRUE,trace=FALSE)
step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## p_vs_b ~ a_score_t + first_star_potion + runtime + year + Action + 
##     Adventure + Animation + Comedy + Crime + Drama + Romance + 
##     Thriller + USA + s_production + d_score + I_saving_world + 
##     I_superhero + season
## 
## Final Model:
## p_vs_b ~ a_score_t + runtime + Action + Crime + Drama + s_production + 
##     d_score
## 
## 
##                   Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                                         788   31608.36 2993.289
## 2     - I_saving_world  0  0.000000       788   31608.36 2993.289
## 3        - I_superhero  1  1.296611       789   31609.65 2991.322
## 4                - USA  1  3.995183       790   31613.65 2989.424
## 5          - Animation  1  3.822134       791   31617.47 2987.521
## 6               - year  1  4.255536       792   31621.73 2985.630
## 7           - Thriller  1  5.007302       793   31626.73 2983.757
## 8             - season  1  6.607694       794   31633.34 2981.926
## 9  - first_star_potion  1 12.669568       795   31646.01 2980.249
## 10            - Comedy  1 13.971486       796   31659.98 2978.604
## 11         - Adventure  1 28.887895       797   31688.87 2977.339
## 12           - Romance  1 42.913246       798   31731.78 2976.430
#final model:step$anova p_vs_b ~ a_score_t + runtime + Action + Crime + Drama + s_production + d_score

fit_low=lm(p_vs_b ~ a_score_t + runtime + Action + Crime + Drama + s_production + 
    d_score ,data=dat)
summary(fit_low)
## 
## Call:
## lm(formula = p_vs_b ~ a_score_t + runtime + Action + Crime + 
##     Drama + s_production + d_score, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.332  -3.353  -1.799   0.919  43.187 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.361700   1.509776  -0.240  0.81072    
## a_score_t    -0.006679   0.003553  -1.880  0.06051 .  
## runtime       0.040056   0.014638   2.737  0.00635 ** 
## ActionTRUE   -1.671751   0.648369  -2.578  0.01010 *  
## CrimeTRUE    -1.339088   0.578186  -2.316  0.02081 *  
## DramaTRUE    -0.723269   0.497297  -1.454  0.14623    
## s_production  0.004354   0.002842   1.532  0.12593    
## d_score       0.019349   0.003868   5.003 6.95e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.306 on 798 degrees of freedom
## Multiple R-squared:  0.06199,    Adjusted R-squared:  0.05376 
## F-statistic: 7.534 on 7 and 798 DF,  p-value: 8.29e-09
# Fit quality assessment
op <- par(no.readonly = TRUE)
par(mfrow=c(2,2))
plot(fit_low)

par(op)


require(bootstrap)
# define functions 
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit_low,x){cbind(1,x)%*%fit$coef} 

# matrix of predictors
X <- as.matrix(dat[c(-1)])
# vector of predicted values
y <- as.matrix(dat[c("p_vs_b")]) 

results <- crossval(X,y,theta.fit,theta.predict,ngroup=10)
cor(y, fit$fitted.values)**2 # raw R2 
##              [,1]
## p_vs_b 0.06563884

Only the runtime and director choice turn out to be significant predictors for a profitability of a low budget film. Interestingly, the predictive power of our model (given by \(R^2\)) is half that of our medium budget film model and quarter that of our high budget film model.

Model Comarison

model_table<-data.frame(model="profit with all movies", Rsq=0.37,number_of_predictors=12)

model_table<-bind_rows( model_table,data.frame(model="profit/budget with high budget", Rsq=0.20,number_of_predictors=7))
           
model_table<-bind_rows( model_table,data.frame(model="profit/budget with median budget", Rsq=0.1,number_of_predictors=8))

model_table<-bind_rows( model_table,data.frame(model="profit/budget with low budget exclude outleir", Rsq=0.065,number_of_predictors=4))

# R squared and numer of significant predictors of our model
model_table%>%kable()
model Rsq number_of_predictors
profit with all movies 0.370 12
profit/budget with high budget 0.200 7
profit/budget with median budget 0.100 8
profit/budget with low budget exclude outleir 0.065 4
require(broom)
p_va=tidy(fit_profit) %>% mutate(term=gsub(".*Action.*", "ActionTRUE", term)) %>%
  mutate(term=gsub(".*Adventure.*", "AdventureTRUE", term)) %>%
  mutate(term=gsub(".*Animation.*", "AnimationTRUE", term)) %>%
  mutate(term=gsub(".*Drama.*", "DramaTRUE", term)) %>%
  mutate(term=gsub(".*Romance.*", "RomanceTRUE", term)) %>%
  mutate(term=gsub(".*Thriller.*", "ThrillerTRUE", term)) 

p_va=p_va %>% mutate(p.value=ifelse(p.value<=0.05,estimate,NA))%>% select(term,p.value)
names(p_va)=c("term","All")
l_va=tidy(fit_low) %>% mutate(p.value=ifelse(p.value<=0.05,estimate,NA))%>% select(term,p.value)
names(l_va)=c("term","Low")
m_va=tidy(fit_median) %>% mutate(p.value=ifelse(p.value<=0.05,estimate,NA))%>% select(term,p.value)
names(m_va)=c("term","Median")
h_va=tidy(fit_high) %>% mutate(p.value=ifelse(p.value<=0.05,estimate,NA))%>% select(term,p.value)
names(h_va)=c("term","High")

table=full_join(p_va,l_va,by="term")
table=full_join(table,m_va,by="term")
table=full_join(table,h_va,by="term")


table= table %>% mutate(term=gsub(".*ActionTRUE.*", "Action", term)) %>%
  mutate(term=gsub(".*AdventureTRUE.*", "Adventure", term)) %>%
  mutate(term=gsub(".*AnimationTRUE.*", "Animation", term)) %>%
  mutate(term=gsub(".*DramaTRUE.*", "Drama", term)) %>%
  mutate(term=gsub(".*RomanceTRUE.*", "Romance", term)) %>%
  mutate(term=gsub(".*ThrillerTRUE.*", "Thriller", term)) %>%
  mutate(term=gsub(".*CrimeTRUE.*", "Crime", term)) 
table[is.na(table)] <-0


table %>% kable
term All Low Median High
(Intercept) 15.9504490 0.0000000 78.0140259 0.0000000
first_star_potion 0.0000000 0.0000000 0.0000000 0.0000000
runtime 0.0011689 0.0400559 0.0000000 0.0079081
budget_ratio 0.0536874 0.0000000 0.0000000 0.0000000
year 0.0017174 0.0000000 -0.0388603 0.0000000
Action -0.0311331 -1.6717507 -0.5181556 0.0000000
Adventure 0.0282550 0.0000000 0.0000000 0.0000000
Animation 0.1366325 0.0000000 1.4467090 1.1137056
Drama -0.0489014 0.0000000 -0.6497370 -0.2977741
Romance 0.0354763 0.0000000 0.0000000 0.7479613
Thriller 0.0000000 0.0000000 -0.5398020 0.0000000
USA 0.0198593 0.0000000 0.0000000 0.0000000
s_production 0.0002076 0.0000000 0.0033027 0.0000000
d_score 0.0008371 0.0193486 0.0072293 0.0053990
season 0.0324749 0.0000000 0.5079618 0.0000000
a_score_t 0.0000000 0.0000000 0.0000000 0.0000000
Crime 0.0000000 -1.3390876 0.0000000 0.0000000
table = table %>% filter(term!="(Intercept)")


table = gather(table,key=budget,value=coefficient,All:High) 
table_g=table %>% filter(term %in% c("Action","Adventure","Animation","Drama","Romance","Thriller","Crime")) %>% filter(budget!="All")



table_g %>% ggplot(aes(x=term,y=coefficient))+geom_bar(stat="identity",aes(fill=budget))+facet_wrap(~budget)+theme(axis.text.x  = element_text(angle=90, vjust=0.5))

table_other = table %>% filter(!(term %in% c("Action","Adventure","Animation","Drama","Romance","Thriller","Crime"))) %>% filter(budget!="All")


table_other %>% ggplot(aes(x=term,y=coefficient))+geom_bar(stat="identity",aes(fill=budget))+facet_wrap(~budget)+theme(axis.text.x  = element_text(angle=90, vjust=0.5))

table_all= table %>% filter(budget=="All")

table_all%>% ggplot(aes(x=term,y=coefficient))+geom_bar(stat="identity",aes(fill=budget))+facet_wrap(~budget)+theme(axis.text.x  = element_text(angle=90, vjust=0.5))

The two graphs give a pictorial summary of the imporatant genres (first bar plot) and and other predictors (second bar plot) that affect the profability of films in the 3 budget categories. It is apparent that each of the genre categories have different set of significant predictors for a movie’s profitability. From the second plot, summer release apparently matters a lot for medium budget films. The third bar plot pictorially represents the influence of each of the factors on profitability of films in the non-stratified model. It seems that animation exerts significant influence on profiatbility.

Logistic Regression

We also tried to categorize the success of a movie a binary variable (with success = profit vs budget ratio > median) on logistic regression as an alternative model building tool.

set.seed(1)
dat_pred = dat %>%
  mutate(p_vs_b=ifelse(p_vs_b>median(dat$p_vs_b),1,0))
inTrain <- createDataPartition(y = dat_pred$p_vs_b,p=0.90)
train_set <- slice(dat_pred, inTrain$Resample1)
test_set <- slice(dat_pred, -inTrain$Resample1)


full <- glm( p_vs_b~a_score_t+first_star_potion+runtime+year+Action+Adventure+Animation+Comedy+Crime+Drama+Romance+Thriller+USA+s_production+d_score+I_saving_world+I_superhero+season , data=train_set, family = "binomial")
summary(full)
## 
## Call:
## glm(formula = p_vs_b ~ a_score_t + first_star_potion + runtime + 
##     year + Action + Adventure + Animation + Comedy + Crime + 
##     Drama + Romance + Thriller + USA + s_production + d_score + 
##     I_saving_world + I_superhero + season, family = "binomial", 
##     data = train_set)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.68615  -1.08790   0.08283   1.10538   1.81898  
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         11.054921  22.962842   0.481 0.630213    
## a_score_t           -0.003352   0.001303  -2.573 0.010092 *  
## first_star_potion    0.605083   0.320268   1.889 0.058851 .  
## runtime              0.006499   0.005335   1.218 0.223094    
## year                -0.005766   0.011408  -0.505 0.613234    
## ActionTRUE          -0.178580   0.248176  -0.720 0.471789    
## AdventureTRUE       -0.320929   0.382015  -0.840 0.400854    
## AnimationTRUE       -0.193800   0.734124  -0.264 0.791789    
## ComedyTRUE          -0.501539   0.193093  -2.597 0.009393 ** 
## CrimeTRUE           -0.264926   0.213954  -1.238 0.215627    
## DramaTRUE           -0.456942   0.184569  -2.476 0.013297 *  
## RomanceTRUE         -0.009862   0.198398  -0.050 0.960353    
## ThrillerTRUE        -0.383240   0.202656  -1.891 0.058613 .  
## USA                  0.208057   0.175277   1.187 0.235222    
## s_production         0.003294   0.001103   2.987 0.002819 ** 
## d_score              0.006697   0.001776   3.772 0.000162 ***
## I_saving_worldTRUE         NA         NA      NA       NA    
## I_superheroTRUE    -13.968797 602.037520  -0.023 0.981489    
## season               0.063778   0.164404   0.388 0.698065    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1006.45  on 725  degrees of freedom
## Residual deviance:  941.16  on 708  degrees of freedom
## AIC: 977.16
## 
## Number of Fisher Scoring iterations: 13
f_hat1 = predict(full, test_set, type = "response")
pred1=data.frame(test_set,f_hat1) %>%
  mutate(pred=round(f_hat1)) %>% 
  mutate(accurate=ifelse(pred==p_vs_b,1,0)) %>%
  filter(!is.na(pred))

nothing <- glm(p_vs_b ~ 1, data=train_set ,family=binomial)
summary(nothing)
## 
## Call:
## glm(formula = p_vs_b ~ 1, family = binomial, data = train_set)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.177  -1.177   0.000   1.177   1.177  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.00000    0.07423       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1006.4  on 725  degrees of freedom
## Residual deviance: 1006.4  on 725  degrees of freedom
## AIC: 1008.4
## 
## Number of Fisher Scoring iterations: 2
bothways =step(nothing, list(lower=formula(nothing),upper=formula(full)),direction="both",trace=0)
formula(bothways)
## p_vs_b ~ d_score + s_production + a_score_t + first_star_potion + 
##     Comedy + Drama + Thriller + I_superhero
final=glm(    p_vs_b ~ s_production + d_score + a_score_t + first_star_potion + Action + Drama + Comedy  + I_superhero + season, data=train_set, family = "binomial")

summary(final)
## 
## Call:
## glm(formula = p_vs_b ~ s_production + d_score + a_score_t + first_star_potion + 
##     Action + Drama + Comedy + I_superhero + season, family = "binomial", 
##     data = train_set)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.66461  -1.08989   0.08462   1.12078   1.66360  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.019586   0.215134  -0.091 0.927461    
## s_production        0.003855   0.001018   3.787 0.000152 ***
## d_score             0.006750   0.001747   3.864 0.000111 ***
## a_score_t          -0.003447   0.001270  -2.714 0.006641 ** 
## first_star_potion   0.643611   0.317169   2.029 0.042434 *  
## ActionTRUE         -0.405924   0.225784  -1.798 0.072202 .  
## DramaTRUE          -0.331243   0.165653  -2.000 0.045540 *  
## ComedyTRUE         -0.345097   0.165731  -2.082 0.037318 *  
## I_superheroTRUE   -14.136436 599.427837  -0.024 0.981185    
## season              0.058922   0.162390   0.363 0.716724    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1006.4  on 725  degrees of freedom
## Residual deviance:  951.1  on 716  degrees of freedom
## AIC: 971.1
## 
## Number of Fisher Scoring iterations: 13
f_hat2 = predict(final, test_set, type = "response")
pred2=data.frame(test_set,f_hat2) %>%
  mutate(pred=round(f_hat2)) %>% 
  mutate(accurate=ifelse(pred==p_vs_b,1,0))%>%
  filter(!is.na(pred))

sum(pred1$accurate)/nrow(pred1)
## [1] 0.575
sum(pred2$accurate)/nrow(pred2)
## [1] 0.6125

Our prediction accuracy is poor, and therefore we abandoned logistic regression and stuck with our linear regression model.


Predictions

Building regression tree

Now, we apply what we learned about the important factors determing a movie’s success to films of 2016. To build prediction trees, we used all the predictors did not discriminate between the budget categories. This approach may have its disadvantages since stratification according to a film’s budget will have allowed us to pick better predictions. However, in many cases, the budget of an upcoming film is well guarded before the film’s release. As shown in the previous section, actor’s scores capture some of budget information and can replace budget as a prediction. Nonetheless, we have selected 2016 films that have already been released or future films whose budget was available online.

require(lubridate)
require(tree)
## Loading required package: tree
require(gridExtra)
theme_set(theme_bw(base_size = 16))
require(rpart)
## Loading required package: rpart
#too large to knit we need to save and import it 
data<-data_checkpoint1%>%mutate(season=ifelse((month(data_checkpoint1$date)>=4 &data$m<=8),1,0))
data<-data%>%mutate(profit=log10(profit+300000000))
data<-data%>%mutate(budget=log10(budget+100))
dat<-data%>%select(profit,title,a_score_t,first_star_potion,runtime,budget,year,Action,Adventure,Animation,Comedy,Crime,Drama,Romance,Thriller,USA,s_production,d_score,I_saving_world,I_superhero,season)

dat_saved<-dat%>%select(-title)
# tree fit
fit <- tree(profit~., data = dat_saved)
plot(fit)
text(fit, cex = 0.8)

# cross validation to optimize tree
fit_1 <- tree(profit~., data = as.data.frame(as.matrix(dat_saved)),
            control = tree.control(nobs = nrow(dat_saved), 
                                   mincut = 1, minsize = 2, mindev = 0.001))

cv_polls <- cv.tree(fit_1)
data_frame(tree_size = cv_polls$size, RSS = cv_polls$dev) %>% 
  filter(tree_size>1 & tree_size < 20) %>%
  ggplot(aes(tree_size, RSS)) + geom_point()

#pruned_fit <- prune.tree(fit)
pruned_fit <- prune.tree(fit_1, best=10)

plot(pruned_fit)
text(pruned_fit, cex = 0.8)

#testing predictions using tree 

require(caret)
set.seed(1)

inTrain <- createDataPartition(y = dat$profit, p=0.9) # Leave out 10% data for later testing
train_set <- slice(dat, inTrain$Resample1)
test_set <- slice(dat, -inTrain$Resample1)
fit <- tree(profit~., data = select(train_set,-title),
            control = tree.control(nobs = nrow(train_set), 
                                   mincut = 1, minsize = 2, mindev = 0.001))

pruned_fit <- prune.tree(fit,best=10)
plot(pruned_fit)
text(pruned_fit, cex = 0.8)

pred <- predict(fit,newdata = select(test_set,-title))


t<-data.frame(predict=pred,true=test_set$profit,title=test_set$title)
t1<-t%>%filter(true>20.5)
ggplot(aes(x=pred,y=true),data=t)+geom_point()+geom_point()+
  geom_abline(intercept = 0, slope = 1,col=2)

RMSE<-postResample(pred,test_set$profit)
RMSE
##      RMSE  Rsquared 
## 0.1022406 0.3395796
#NRMSE
RMSE[1]/(max(t$true)-min(t$true))
##      RMSE 
## 0.1718808
RMSE[1]/mean(t$true)
##      RMSE 
## 0.0119535
#cv.tree(fit)

As we can see, the model can predit 25% variance in the profit of films. It may be more useful to qualitatively rank the films based on its profit.

Using regression trees to predict the rank of upcoming films

#fit <- tree(profit~., data = as.data.frame(as.matrix(select(dat,-title))),control = tree.control(nobs = nrow(dat), mincut = 1, minsize = 2, mindev = 0.001))
upcoming_movies <- read.csv("upcoming_movies.csv")
data<-upcoming_movies
data<-data%>%mutate(director=as.character(director),star1=as.character(star1),star2=as.character(star2),star3=as.character(star3),star4=as.character(star4),star4=as.character(star4),star5=as.character(star5))
score_director<-score_director%>%mutate(director=as.character(director))
data<-left_join(data,score_director,by.x="director",by.y="director")
## Joining by: "director"
t<-data%>%select(title,star1:star5)%>%left_join(score_actors,by=c("star1"="name"))
colnames(t)<-c("title",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1")
t<-t%>%left_join(score_actors,by=c("star2"="name"))
colnames(t)<-c("title",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1","a_score2")
t<-t%>%left_join(score_actors,by=c("star3"="name"))

colnames(t)<-c("title",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1","a_score2","a_score3")
t<-t%>%left_join(score_actors,by=c("star4"="name"))
colnames(t)<-c("title",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1","a_score2","a_score3","a_score4")
t<-t%>%left_join(score_actors,by=c("star5"="name"))
colnames(t)<-c("title",  "star1",   "star2",   "star3",   "star4",  "star5",   "a_score1","a_score2","a_score3","a_score4","a_score5")
t<-t%>%mutate(a_score1=ifelse(is.na(a_score1),0,a_score1),a_score2=ifelse(is.na(a_score2),0,a_score2),a_score3=ifelse(is.na(a_score3),0,a_score3),a_score4=ifelse(is.na(a_score4),0,a_score4),a_score5=ifelse(is.na(a_score5),0,a_score5))

t<-t%>%mutate(first_star_potion=a_score1/(a_score1+a_score2+a_score3+a_score4+a_score5))
t<-t%>%mutate(first_star_potion=ifelse(first_star_potion==Inf,0,first_star_potion))
#dat_star<-data%>%select(TMDBID,budget)%>%right_join(t,by="TMDBID")
t<-t%>%mutate(a_score_t=(0.4*a_score1+0.30*a_score2+0.20*a_score3+0.05*a_score4+0.05*a_score5))


data<-t%>%left_join(data,by='title')
data = data %>%mutate(date=parse_date_time(releaseDate,"mdy"))
data<-data%>%mutate(season=ifelse(month(data$date)>=4&month(data$date)<=8,1,0))
data<-data%>%left_join(select(data_checkpoint1,17,44))
## Joining by: "production"
data[is.na(data)]=0
data<-unique(data)
data<-data%>%mutate(USA=1)
dat<-data%>%select(title,a_score_t,first_star_potion,runtime,budget,year,Action,Adventure,Animation,Comedy,Crime,Drama,Romance,Thriller,USA,s_production,d_score,I_saving_world,I_superhero,season)



t<- as.matrix(dat[c(-1)])
t<-as.data.frame(t)

pred <- predict(fit_1,newdata = t)
#plot(pruned_fit)
#text(pruned_fit, cex = 0.8)

t<-data.frame(predict=pred,title=dat$title)
t<-t%>%mutate(predict=10^(predict)+300000000)
t<-t[order(-t$predict),]
t<-unique(t)
t<-t%>%mutate(rank=order(predict,decreasing=TRUE))
t<-t[order(t$rank),]
t%>%select(title,rank)%>%kable
title rank
Ghostbusters 1
X-Men: Apocalypse 2
Zootopia 3
Hail, Caesar! 4
Zoolander 2 5
Jane got a gun 6
Grimsby 7
Dirty Grandpa 8
Misconduct 9
Whiskey Tango Foxtort 10
The Boss 11

Among the 11 films in our “upcoming movies” list, we predict that Ghost Busters, X-Men: Apocalypse and Zootopia should take the top crown. In reality Zootopia’s box office revenue is clost to $1 billion, and it would fairly challenging to catch up to it.


Conclusion

Overall, the success of movies can be challenging to predict. Our data analysis flushes out many interesting trends in the movie landscape. Our key finding is that production companies should pay attention to different sets of movie features for different budget catergories to finance a profitable film. Certain features such as director choice and high production value heavily influence the profitability of a film.

If you want to direct a profiable film next year, check out our actor, director and production company scores, as well as our regression model results from the 3 budget categories.

Lastly, let us look at some of our outliers that completely crushed our prediction models.

#Movies made much more profit (log scale) than others 
outlier %>%slice(1:4)%>% kable()
title budget profit p_vs_b
Avatar 19.28357 20.62397 1.0695099
Star Wars: The Force Awakens 19.11383 20.62397 1.0790078
Titanic 19.11383 20.62397 1.0790078
The Lone Ranger 19.35677 18.71551 0.9668714
#Movies made high profit vs budget ratio
outlier %>%slice(5:22)%>% kable()
title budget profit p_vs_b
Clerks 27000 3124130 115.70852
The Full Monty 3500000 254350122 72.67146
Pi 60000 3161152 52.68587
Lost & Found 1 99 99.00000
The Blair Witch Project 25000 247975000 9919.00000
My Big Fat Greek Wedding 5000000 363744044 72.74881
Napoleon Dynamite 400000 45718097 114.29524
Super Size Me 65000 28510078 438.61658
Primer 7000 417760 59.68000
Saw 1200000 102711669 85.59306
Open Water 130000 54537954 419.52272
Facing the Giants 100000 10078331 100.78331
Once 160000 20550513 128.44071
Paranormal Activity 15000 193340800 12889.38667
Catfish 30000 3015943 100.53143
Paranormal Activity 2 3000000 174512032 58.17068
From Prada to Nada 93 2499907 26880.72043
Insidious 1500000 95509150 63.67277

The first table shows 4 films - 3 with massive profits and the last with massive loss. The second table shows films with tremendous profit to budget ratio. Many of the low budget films that made “comparatively massive” yet “overall humble” profits are captured in this table.